scrape olx India website to get used car price data
Figure 1: scraping olx website for cars
Everyone has or will consider buying a car for various reasons. As for me, browsing through olx is a hobby and a passion that has been a part of my weekend fix for quite sometime now. Meaningless browsing through classifieds although joyful, does not help us understand trends and patterns. Therefore, I decided to occasionally scrape the olx website for used car prices and make visualizations from the same. The primary objective was to have fun, and also grab some good deals when they present themselves.
In the first step, we load up the packages “tidyverse”, “httr”, and “rvest” to make sure that all the functions we call will work seamlessly. Now, I present to you the function “olxfind”.
olxfind<- function(area,yearstart, yearend, make){
link <- paste0("https://www.olx.in/",area,"/cars_c84?filter=first_owner_eq_1%2Cmake_eq_",make,"%2Cyear_between_",yearstart,"_to_",yearend)
page<- link |> session() |> read_html() # It is important to create a session first or else you may get a 403 error
prices<- page |> html_nodes("._3GOwr") |> html_text()
prices
yearmileage<- page |> html_nodes(".KFHpP") |> html_text()
carname<<- page |> html_nodes("._4aNdc") |> html_text()
yearmileage
# pic <- page |> html_attrs("img")
polo<<- tibble(prices, yearmileage, carname)
polo1<<- polo |> separate(yearmileage, into= c("year", "mileage"), sep = " - ") |>
mutate(mileage = str_remove_all(mileage, pattern = "km")) |>
mutate(mileage = str_remove_all(mileage, pattern = "\\.+0")) |>
mutate(mileage = str_remove_all(mileage, pattern = "[:punct:]")) |>
mutate(prices = str_remove_all(prices, pattern = "[:punct:]")) |>
separate(prices, into = c("symbol", "prices"), sep = " ") |>
select(year, mileage, prices, carname) |>
mutate(across(c(1:3), as.numeric))
return(polo1)
}
olxfind(area= "dehradun_g4059236", yearstart = "2014", yearend = "2020",make = "volkswagen")
# A tibble: 40 × 4
year mileage prices carname
<dbl> <dbl> <dbl> <chr>
1 2016 79000 600000 Volkswagen Vento
2 2014 52000 395000 Volkswagen GTI
3 2014 55000 385000 Volkswagen Polo
4 2016 31690 450000 Volkswagen Polo
5 2018 38000 700000 Volkswagen Vento
6 2014 65823 420000 Volkswagen Polo
7 2017 45000 550000 Volkswagen Ameo
8 2018 24000 625000 Volkswagen Polo
9 2020 27000 790000 Volkswagen Polo
10 2019 35000 535000 Volkswagen Polo
# … with 30 more rows
polo1<- olxfind(area= "dehradun_g4059236", yearstart = "2014", yearend = "2020",make = "volkswagen")
class(polo1)
[1] "tbl_df" "tbl" "data.frame"
glimpse(polo1)
Rows: 40
Columns: 4
$ year <dbl> 2014, 2016, 2014, 2016, 2018, 2014, 2017, 2018, 2020…
$ mileage <dbl> 52000, 79000, 55000, 31690, 38000, 65823, 45000, 240…
$ prices <dbl> 395000, 600000, 385000, 450000, 700000, 420000, 5500…
$ carname <chr> "Volkswagen GTI", "Volkswagen Vento", "Volkswagen Po…
polo1<- as_tibble(polo1)
This function takes the following arguments( all strings) i.e., area, yearstart & yearend, and make.
Area is one of the most important arguments for this function. You need to tweak specify this argument accurately, if you want to get area-specific results. As shown in the function definition for olxfind, you can see that all the arguments used are primarily for the purpose of creating the pagelink that will be used to scrape the site.
Therefore, before running the function, you should ideally visit olx and from the area button, select the area of your choice. then from the url you will have to copy the string specifying the region of your choice in the function argument. for example, if I use only “dehradun” for the area argument, we will get a error. Since olx adds Dehradun as “dehradun_g4059236”, you need to specify that in the area argument. Suppose, you want to search for cars in delhi region then the link for olx becomes “https://www.olx.in/delhi_g4058659/cars_c84”. In this case, the area code for delhi is “delhi_g4058659”, you need to specify that in ther argument call for area.
Notice that the product call is “cars_c84”, which is already there in the link so you do not need to modify that from within the function. In case, you are interested in motorcycles(“motorcycles_c81”), or mobile-phones(“mobile-phones_c1453”).
You can also filter the cars based on the year of manufacture. This will certainly help you narrow down to the relevant results and filter the unnecessary information. Although year is a numeric variable, for the purposes of this function it is a string since its pasted into a string to form a link so make sure you write “2014” rather than 2014 in the function argument.
The Olx website provides you the option to select cars from various manufacturers. In olxfind you can get data for only one car manufacturer at a time. You can save the data from each call with the name of the manufacturer as a separate column and then use “dplyr::bind_rows” to join them together. This will ensure that you get the maximum number of listings from each manufacturer.
Right now, this function cannot be used to parse more than 40 entries, because of the design of the Olx website. If someone has any idea how to get all data points and bypass the “load more” button please share your insights in the comments section. I am also looking into the possibility of downloading the images associated with each data point to the database. In its present form, the function requires users to tweak a number of things if they want to look for other product types. Later, I might add some other conditional statements that will link with the “product” argument and create the relevant page links for users.
Now that we have the data at hand, we can probably create some exploratory visualizations from them to better understand the trends and patterns.
library(plotly)
# names(polo1)
plot1<- polo1 |>
mutate(year = as.factor(year)) |>
ggplot(aes(x = year, y = prices, group = year)) +
geom_violin(aes(alpha = 0.001), show.legend = FALSE)+
geom_boxplot(aes()) +
geom_jitter(aes(colour = mileage)) +
stat_summary(fun=mean, geom="point", shape=5, size=4)+
scale_y_continuous(n.breaks = 10) +
theme_bw()
#plot1
plotly::ggplotly(plot1)
This graph reveals that there is some overalp between the used car prices across year of manufacturing. This can be attributed to factors like the model variant, number of previous owners and the colour of the car. However, that is a topic for another day, in this case we will make do with only the variables we hae at our disposal. Let’s draw another graph by summarizing the mean prices and mileage of each car grouped by the years.
polo1 |> mutate(year = as.factor(year)) |>
group_by(year) |> summarise(prices = mean(prices)) |> kableExtra::kbl() |> kableExtra::kable_styling(position = "center")
| year | prices |
|---|---|
| 2014 | 416363.6 |
| 2015 | 498000.0 |
| 2016 | 498571.4 |
| 2017 | 473222.2 |
| 2018 | 590000.0 |
| 2019 | 490000.0 |
| 2020 | 855000.0 |
In this next plot we create a summary of the mean prices of used VW cars across the years of 2014-2019. We find that there is a sharp decrease in the prices of used VW cars after the first three years. There is negligible difference between the mean prices of cars that are six or seven years old. But there is an appreciable drop in car prices when the age is eight years. The biggest drop in prices occur when the car becomes three year old, whcih leads to roughly 50% decrease in value. The next drop in value occurs, when the car becomes more than five years old. The car can be found at its lowest value from eight year onwards.
polo1 |> mutate(year = as.factor(year)) |>
group_by(year) |>
summarise(pricesm = round(mean(prices),2), mileagem = mean(mileage)) |>
ggplot(aes(x = year, y = pricesm)) +
geom_col()+
geom_label(aes(label = pricesm))
If you are looking for a used VW car, it might be better to go for cars older than four years.
Now that we have covered the VW cars, a question arises in my mind. How does the value-retention of VW cars compare to that of honda cars or Hyundai or suzuki for that matter. We can compare across the three major car manufacturers and see whose car holds the most value in the used car market over the years. First we need to download the relevant data for each manufacturer. Since we already have the “olxfind” function, we will only need to change the name of the manufacturer and keep everything else the same.
polo1<- olxfind(area= "dehradun_g4059236", yearstart = "2012", yearend = "2020",make = "volkswagen")
honda<- olxfind(area= "dehradun_g4059236", yearstart = "2012", yearend = "2020",make = "cars-honda")
glimpse(honda)
Rows: 40
Columns: 4
$ year <dbl> 2016, 2014, 2017, 2019, 2014, 2018, 2017, 2013, 2018…
$ mileage <dbl> 52000, 82000, 46852, 16100, 1, 55000, 16300, 58000, …
$ prices <dbl> 725000, 500000, 660000, 725000, 480000, 700000, 5350…
$ carname <chr> "Honda City", "Honda City", "Honda BR-V", "Honda Ama…
suzuki<-olxfind(area= "dehradun_g4059236", yearstart = "2012", yearend = "2020",make = "maruti-suzuki")
glimpse(suzuki)
Rows: 40
Columns: 4
$ year <dbl> 2015, 2017, 2019, 2018, 2017, 2017, 2018, 2019, 2018…
$ mileage <dbl> 63328, 39549, 22000, 50000, 20800, 22000, 29000, 130…
$ prices <dbl> 520000, 425000, 625000, 370000, 625000, 325000, 4300…
$ carname <chr> "Maruti Suzuki Swift Dzire", "Maruti Suzuki Wagon R"…
# toyota<-olxfind(area= "dehradun_g4059236", yearstart = "2012", yearend = "2020",make = "toyota")
# glimpse(toyota)
Now we have four dataframes named after each car manufacturer and we need to join them together by the rows. But before doing that, we need to ensure that we add a column called manufacturer to each data frame so that once they are joined we can tell them apart.
honda<- honda |> mutate(manufacturer = "honda")
polo1<- polo1 |> mutate(manufacturer = "VW")
suzuki<- suzuki |> mutate(manufacturer = "suzuki")
#toyota<- toyota |> mutate(manufacturer = "toyota")
# I know I should have written a function for it or a loop
glimpse(honda)
Rows: 40
Columns: 5
$ year <dbl> 2016, 2014, 2017, 2019, 2014, 2018, 2017, 2013,…
$ mileage <dbl> 52000, 82000, 46852, 16100, 1, 55000, 16300, 58…
$ prices <dbl> 725000, 500000, 660000, 725000, 480000, 700000,…
$ carname <chr> "Honda City", "Honda City", "Honda BR-V", "Hond…
$ manufacturer <chr> "honda", "honda", "honda", "honda", "honda", "h…
Now lets join all the four data sets with the bind_rows command from dplyr.
# Now we shall bind all the four data frames by their rows
joineddata<- bind_rows(honda, polo1, suzuki)
Before proceeding we will check that all the car manufacturers are there.
joineddata<- joineddata |> mutate(across(where(is.character), as.factor))
joineddata|> group_by(carname, manufacturer) |>
summarise(prices = mean(prices),
count = n()) |>
ggplot(aes(y = carname, x = prices, label = count)) +
geom_col() +
geom_label() +
facet_wrap(~manufacturer)
Some of the cars are more expensive than others. We can notice that the Honda WR-V, and
Now that we have the data of all four car manufacturers, we can create some comparative visualisations that will help us answer some questions about value-retention in the used car market.
plot2<- joineddata |>
mutate(year = as.factor(year)) |>
group_by(year, manufacturer) |>
summarise(prices = mean(prices),
mileage = mean(mileage),
count = n()) |>
mutate(prices = round(prices,1)) |>
ggplot(aes(x = year, y = prices, label = count)) +
geom_col(aes(fill = manufacturer), position = "dodge") +
coord_flip()+
scale_y_continuous(n.breaks = 10) +
theme(legend.position = "bottom")
# geom_label(aes(label = prices), nudge_y = 300, check_overlap = TRUE)
ggplotly(plot2)
plot2<- joineddata |>
mutate(year = as.factor(year)) |>
group_by(year, carname) |>
summarise(prices = mean(prices),
mileage = mean(mileage),
count = n()) |>
mutate(prices = round(prices,1)) |>
filter(carname %in% c("Volkswagen Polo", "Honda City", "Maruti Suzuki Swift", "Maruti Suzuki Swift Dzire", "Honda Amaze", "Volkswagen Ameo")) |>
ggplot(aes(x = year, y = prices, label = count)) +
geom_col(aes(fill = carname), position = "dodge") +
coord_flip()+
scale_y_continuous(n.breaks = 10) +
theme(legend.position = "bottom")
# geom_label(aes(label = prices), nudge_y = 300, check_overlap = TRUE)
ggplotly(plot2)
This is very interesting! We can see that the honda cars are more expensive than the suzuki and VW cars even after three years. This is probably because Honda only sells more expensive and premium models as compared to suzuki and Volkswagen. As we move through the years, we can see that the suzuki and VW cars show the sharpest decline in the first five years. Then something strange happens, the prices of six and seven year old suzuki and VW cars are more than five year old offerings. In fact, six year old cars cost as much as three-year old cars. This is probably due to some disparity in the data wherein, most of the higher-end variants and models are listed after five years. On the other hand, most people listing their used cars after three years are listing lower level variants. If we compare this data along with that of used honda cars, it is evident that the prices of Honda cars decrease steadily in the used car market and therefore are more predictable than VW and suzuki cars.
Keep this in mind when choosing your next used vehicle. I hope you found this article insightful. If you have any suggestions please feel free to share the same in the comments section.